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Abstract 

In the paper, a complex statistical characteristics of a Ukrainian 
novel is given for the first time. The distribution of word-forms with 
respect to their size is studied. The linguistic laws by Zipf-Mandelbrot 
and Altmann-Menzerath are analyzed. 


1 Introduction 


Year 2006 is the 150th anniversary of Ivan Franko (1856-1916), the promi¬ 
nent Ukrainian writer, poet, publicist, philosopher, sociologist, economist, 
translator-polyglot and the public fig ure. His incomplete collected works 


were published in 50 volumes (Franko!. 11976-86( 1. With this name the notion 
of national identity in the Western Ukraine is connected. Franko’s works 
have intensive plot and interesting topic. In this paper, we make an analy¬ 
sis of the novel Ihepexpecni cmeotcKU (The Cross-Paths, referred also as The 
Crossroads in Encyclopaedia Britannica). 

The events of the novel unfold at the turn of the 20th century. This is 
a story of young lawyer Eugenij Rafalovyc coming to the provincial town 
in Halytchyna (Galicia) to continue his practice. He meets there his former 
teacher, Stalski, telling to Eugenij about his matrimonian life, telling, in 
particular, that he does not talk at all to his wife for ten (!) years as a pun¬ 
ishment. The point is that Stalski appears to be married to Regina, Eugenij’s 
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Jugendliebe... This novel is about lawlessness and justice, meanness and no¬ 
bleness, consciousness and subconsciousness. Social motives, psychologism, 
love and tragedy a re ta ngled here in an intricate way. The novel was trans¬ 
lated into French ( Frankd 19891). see also some excerpts in Anthologic. 
(20011. and into Russian ( Frankoi 1956). 


The present paper is the first attempt in Ukrainian linguistics to make 
a comprehensive quantitative study of a particular work of art u sing mod- 
ern techniques. Prev i ous word-indices of Ukrainian wr i ters ( Vascenkol 1964; 
Zovtobriu 3 Il978 791: iKovalvk et all Il99fll: iLiik’ianiu^. 2004F were compiled 
manually, their aim was to establish the number of occurrences of a particular 
word, not to make an analysis of such data. 

Small efforts in the quantitative study o f the works written by Ivan Frank o 
were made recently for his fairy-tales, see ( Holovatch and Palchvkov . 2005). 

The present study is based on the frequency dictionary compiled by the 
authors on the basis of the edition ( Frankd . 19 791) u sing the princip les 


con¬ 


sistent with those described in our recent paper (|Buk and Rovenchakl . 120041) . 
We have also analyz ed t he main differences of this edition comparing to the 
first (separate) one ( frank o. 190(1 ). 


2 Basic Principles of the Text Analysis 

We consider a token as a word in any form (a letter or alphanumeric sequence 
between two spaces), irrespective to the language. Thus, ‘1848’, ‘60 -hh’, 
‘§136’ were treated as one token. 

We have partially restored the use of letter [g] eliminated from the 
Ukrainian alphabet in 1933 during Stalin’s rule as a step toward removing 
the differences between Ukrainian and Russian orthography. The letter g was 
left to denote both [fi] and [g] sounds having, however, a sense-distinguishing 
role: gmm ‘oppression’ versus fnim ‘wick’, spamu ’to play’ versus fpamu 
‘bars’, etc. The letter f was rehabilitated in the Ukrainian orthography in 
1993, but in a much narrower extent. 

We have tried to restore the use of this letter using the edition ( Franko . 
1900) and following modern Ukrainian orthographical tendencies. First, in 
the proper names, Pefina ‘Regina’, Ba/Mcm ‘Wagman’, Peccejib6epf ‘Res- 
selberg’. Second, in loan words from Polish, German, Latin: fpamyjitoeamu 
<— gratulowac, femeifim <— Geschaft, Mopf <— Morgen (a measure of area), 
aSnefayia abnegation, etc. And, of course, in those words which are now 
traditionally written with f\ /anon , famynoK , fpacyeamu, f'pyrnn , etc. 
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2.1 Euphony 

In Ukrainian, some words appear in different phonetic variants caused by 
the ‘phonetic environment’ (i.e., the notion of euphony, cf. Polish w/we, 
z/ze, Russian c/co , k/ko, also English indefinite article a/an or initial con¬ 
sonants mutations in Irish). They are initial e/y, eid/eido, 3/13/31/30, i/u 
and respective prepositions and conjunctions), final -c.h/c/. Such word 
variants were joined under one (the most frequent) form. Instead, the ver¬ 
nacular variants are given separately. For instance, adyKa(n)m and adeouam 
(‘advocate’), nepezpatfi and napaspatfi (‘paragraph’), the second form in the 
examples is a normative one. 

2.2 Homonyms 

The problem of homonyms is one the most complicated problems slowing 
down the process of automatic text processing. This is connected with a 
very high frequency of auxiliary parts of speech having in Ukrainian (as well 
as other Slavic languages) the same form. For instance, in the text under 
investigation one has 1956 occurrences of ipo, distributed as follows: 1360 
- conjunction (‘that’, ‘which’), 495 - pronoun (‘what’), 101 - particle. The 
token a occurs 1065 times as a conjunction (‘and’, ‘but’), 33 times as a 
particle and 6 times as an interjection. The token hk is found 389 times as 
a conjunction (‘as’), 125 - as and adverb (‘how’), 55 - as a particle. Note, 
however, that the translations are very approximate due to a wide range 
of the word meanings. The ‘full-meaning’ words occupy a bit lower ranks: 
Mamu appears 35 times as the verb ‘to have’ in Infinitive versus 4 occurrences 
as the noun ‘mother’ in Nominative Singular. 

While the above examples are standard and expectable for Ukrainian, we 
have also met some specific parallel forms: nauMumu , the noun ‘hireling’ in 
Plural Nominative, and nauMumu , the verb ‘to hire’ in Infinitive; sycim, 
the adjective ‘dense’ in Singular Accusative Feminine, and zycmi the noun 
(in fact, fycmi) ‘taste’ in Singular Genitive. 

The analysis of the homonyms could not be fully made in an automatic 
way, even the contextual analysis is not sufficient. Therefore, a manual con¬ 
trol was necessary. 

What is interesting, the problem of homonyms appears even in a small 
subset of words written in the Latin script: we had to distinguish Latin and 
German in , German definite articles die (Plural and Feminine), Latin maxima 
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(adjective, Feminine from maximus and noun, Plural from maximum ). 


3 Statistical Data 


The text size N is 93,885 tokens. In the novel, forty five tokens are alphanu¬ 
meric, 208 are written in the Latin script (in German (87), Latin (55), Polish 
(38), French (14), Czech (9), Yiddish (4) languages, once the letter ‘S’ is 
referred to describe the form of a river), all the remaining are Ukrainian 
ones. 

The number of different word-forms is 19,391. The number of different 
words (lemmas) - vocabulary size V - is 9,962. 

Mean word length is 4.83 letters and mean sentence length is 9.7 words. 

Vocabulary richness (the variety index) calculated as the relation of the 
number of words to the text size equals V/N = 0.106. 

Vocabulary density is calculated as the ratio of text size and the vocab¬ 
ulary size N/V = 9.42. Otherwise said, a new word is encountered as every 
9-10 words are read. 

The number of hapax legomena V± is 4,902 making up thus 49.2 per cent 
of the vocabulary and 5.22 per cent of the text. These parameters are also 
known as the exclusiveness indices of text and vocabulary, respectively. 

The concentration indices are connected with the number of words having 
the absolute frequency equal and higher than 10: 1V 10 = 74, 692 for the text 
and Vio = 1,127 for the vocabulary. The concentration indices are therefore 
Niq/N = 79.6% for the text and Vjo/U = 11.3% for the vocabulary. 

The main feature of the edition dFrankol 1900h influencing the statisti¬ 
cal param eters of the investigated novel in comparison to the modern text 
( Frankol. 1979tl is the usage of the verbal reflexive particle -ch. In modern 
Ukrainian, it is written together with the respective verb, unlike the ortho¬ 
graphical rules of 1900 (cf. also the shortened variant -cu written in one word 
in both older and modern texts). In the novel, this particle is used 2496 times 
in 1485 different verbal forms. This frequency would correspond to the sec¬ 
ond (!) highest rank, after i/u ‘and’ (3204). Such a result correlates with, 
e.g., modern Polish, where the corresponding word sig also belongs to the 
most frequent ones (PWN, 2005 b see Table 1. 
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4 Distributions and Linguistic Laws 


We have analyzed the distribution of word-forms with respect to the number 
of letters and found that such a dependence has two maxima, see Fig.[TJ^a). 
As the size of our sample is quite large, this fact can signify that some other 
unit must be considered as a proper, or natural one. A phoneme (sound) 

" appr<— ; 




,0 Ur 


.If, 



Number of letters 



(b) 


Figure 1: The distribution of word-forms (fraction of unity, vertical axis) 
with respect to the number of constituting letters (a) and sounds (b). 


The dependence between the fraction of word-forms W containing exactly 
if phonemes can be approximated by the following (empirical) formula: 


W = A(f b e~ aip2 , A = 


2a( ft+1 )/ 2 


r(^) ’ 

where the value of A is obtained from the normalization condition 


1 = 


roc 

/ Atp b e 
Jo 


b p-aV 2 


dip. 


( 1 ) 


( 2 ) 


The fitting parameters are as follows: b = 0.6347, a = 0.02579, see 
Fig. [Hb). The results regarding fitting in this work were obtained using the 
nonlinear least-squares Marquardt-Levenberg algorithm implemented in the 
GnuPlot utility, version 3.7 for Linux. 

If a syllable is used as a length unit , we have utilized the formula similar 
to Altmann-Menzerath law ( Altmann . 198oll but with the argument shifted 
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by unity: 


W = B(s + l) d e~ 


B = 


7 


d+1 


( 3 ) 


r(d+i)' 

In the above formula, W is the fraction of word-forms containing exactly s 
syllables. The reason to introduce the shift is a high frequency of non-syllabic 

words (particles 6, ate, prepositions e, 3, conjunction u), which wer e not_ 

treate d as proclitics, in contrast to, e.g., the approach bv lGrzvbek and Altmann 
( 2002 ) for similar Russian words. We have put the length of such words to 
be zero. Thus, the distribution function has to be non-zero at the origin 
( S = 0). 

The fitting parameters are as follows: d = 5.805, 7 = 2.245, see. 

U 1 1 rr 




(b) 


Figure 2: The distributions regarding syllabic structure of the words: the 
fraction of word-forms with respect to constituting syllables (a), the evidence 
of Menzerath’s law (the right-most point was excluded from the fit due to 
poor statistical reliability). 


In order to check the validity of Menzerath’s law we have also studied the 
dependence of the mean syllable length M on the word length s (measured 
in syllables). We have used the formula 

M = M oc + Bs c . (4) 

The constant M^ denotes a possible asymptotic value of the mean syllable 
length in a very long word, the exponent c is a negative number. In this way, 
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we also obtain an infinite value of the syllable length for non-syllabic words 
(s = 0). The fitting parameters are as follows: = 1.984, B = 1.464, 

c= -1.119. 


The form M = As b e cs (see, e.g., Ko hler (2002)) appeared to 


give a poorer 

fit, leading in particular to large mean syllable length of long words due to 
the exponential increase. 

We have calculated the parameters of Zipf’s law fitting our frequency data 
in different ranges of ranks. One has the word frequency F connected with 
its rank r via simple relation: F(r) = A/r z . The values of the exponent z 
can be related to the different types of vocabulary. Visually in Fig. Owe can 
see three such rank domains: 10 < r < 200 (z = —0.999), 200 < r < 1000 
(z = -1.05), r > 1000 (z = -1.20). 

The parameters of the Zipf' -Mandelbrot law F(r) = A/ (r + C) b were also 
calculated for the whole rank domain: A = 25000, b = 1.14, C = 5.2. See 


□ 



Rank 

(a) 



(b) 


Figure 3: The transition to different regimes in Zipf’s law (a) and text cov¬ 
erage (b). In (a) the dashed-dotted line is the fit to the Zipf-Mandelbrot 

law. 


The portion of text T covered by first r ranked words can be fitted by the 
dependence T(r ) — k In r + T 0 . While for 10 < r < 200 the growth of the text 
coverage is characterized by k — 0.133, it slows a bit for 200 < r < 2,000 
with k = 0.1155 and even more for the larger values of r, k = 0.0833 for 
r > 2,000. See Fig. 01(b) for details. 
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5 Comparison 


To complete our paper, we adduce the comparison of the top-ranked words 
in five different languages (see Table 1 in the next page). The Ukrainian 
text is th e novel under consideration, the English is Ulysses by James Joyc e 
(llJlvssesl. ndh the Japanese is Kokoro by Natsume S oseki ( Ohtsuki . l2005h . 
Russian corresponds to the vocabulary of Lermontov f FDll nd), and Polish 
is from PWN ( 2005 ). 

As expected, the majority of these words are auxiliary parts of speech, 
irrespective to language. What is interesting, in the text of a particular writ¬ 
ing (Ukrainian, English and Japanese examples) some common features are 
found. Namely, the names of the characters have very high frequency allow¬ 
ing them to reach the highest ranks, together with addresses nan , Mr ., c? A. 
Also, the nouns denoting human body-parts are quite frequent, in particular 
‘hand’ is not found only in the Polish list (the reason is probably a com¬ 
paratively large fraction of journalistic texts in the PWN corpus). These 
phenomena require additional interlingual studies. 
















Tabic 1: The top-ranked words, in the right columns the relative frequency 
is given in per cent. 


rank 

Ukrainian 

! Polish 

Russian 

English 

Japanese 

i 

i/n 
-c a. 

3.420 

2.659 

w/we 

3.237 

H 

4.117 

the 

5.653 

-f6 

4.923 

2 

BiH 

2.632 

i 

2.589 

H 

3.207 

he 

3.356 

h‘V> (v) 

3.966 

3 

He 

2.394 

bye 

2.104 

B 

2.523 

of 

3.074 

i'i 

1.945 

4 

s/y 

2.304 

si§ 

2.069 

OH 

2.441 

and 

2.726 


1.695 

5 

H 

1.842 

z/ze 

1.779 

He 

2.239 

a/an 

2.710 

ti 

1.490 

6 

Ha 

1.606 

na 

1.689 

6bitb 

1.294 

be 

2.263 


1.331 

7 

3 / i3 / 3i / 30 

1.598 

nie 

1.535 

Ha 

1.290 

to 

1.877 


1.101 

8 

n^o ( conj) 

1.449 

on 

1.437 

OHa 

1.260 

in 

1.869 

&1V (a) 

1.090 

9 

6yTH 

1.388 

do 

1.178 

TBI 

1.158 

I 

1.381 

©7 

0.840 

10 

TOH 

1.299 

ten 

1.109 

c 

1.079 

she 

1.104 

A 

0.764 

11 

cefi/n;eH 

1.222 

to 

1.005 

KaK 

0.990 

that 

0.985 


0.749 

12 

flo 

1.143 

ze 

0.853 

3TOT 

0.838 

with 

0.950 


0.639 

13 

a {conj) 

1.134 

a 

0.773 

HO 

0.762 

it 

0.892 

ffl< 

0.622 

14 

BOHa 

0.962 

ktory 

0.650 

Beet 

0.738 

on 

0.799 


0.619 

15 

naH 

0.937 

o 

0.644 

MOH 

0.722 

say 

0.756 


0.525 

16 

BH 

0.898 

miec 

0.585 

BBI 

0.716 

for 

0.731 

A 

0.517 

17 

ajie 

0.749 

jak {adv) 

0.501 

OHH 

0.645 

have 

0.723 

M 

0.483 

18 

mo ( pron) 

0.685 

tak 

0.445 

hto {conj) 

0.634 

you 

0.716 

b'K 5 A. 

0.477 

19 

CBiH 

0.649 

ja 

0.441 

hto {pron) 

0.557 

they 

0.639 

hQ 

0.463 

20 

(B/y)B0Cb 

0.590 

CO 

0.436 

TOT 

0.551 

all 

0.500 

b 

0.440 

21 

BOHH 

0.559 

rok 

0.424 

K 

0.512 

at 

0.488 


0.406 

22 

3a 

0.536 

od 

0.398 

CBOH 

0.453 

by 

0.480 

-A. 

—f 

0.395 

23 

CBremft 

0.534 

po 

0.394 

a {conj) 

0.444 

as 

0.448 

m 

0.383 

24 

3HaTH 

0.456 

ale 

0.384 

Tax 

0.436 

from 

0.411 

4 7 

0.372 

25 

TaKHH 

0.456 

taki 

0.325 

6bl 

0.432 

do 

0.377 


0.369 

26 

HKHH 

0.445 

moc 

0.323 

OflHH 

0.430 

or 

0.363 


0.366 

27 

6h/6 

0.425 

przez 

0.321 

3a 

0.410 

Bloom 

0.351 


0.358 

28 

hk (conj) 

0.414 

za 

0.309 

MOMb 

0.396 

out 

0.339 

m 

0.349 

29 

MaTH {v) 

0.406 

dla 

0.288 

MBI 

0.393 

what 

0.338 

*0-5 

0.335 

30 

npo 

0.406 

juz 

0.265 

y 

0.353 

not 

0.338 

& 

0.332 

31 

MOBHTH 

0.395 

czy 

0.262 

>Ke 

0.348 

my 

0.317 

£fz 

0.309 

32 

me 

0.381 

bardzo 

0.262 

3HaTb 

0.329 

up 

0.313 

t <* 

0.301 

33 

cede 

0.379 

tylko 

0.255 

CKa3aTb 

0.322 

one 

0.279 

A' 

0.301 

34 

ny 

0.374 

swoj 

0.241 

TBOH 

0.304 

like 

0.277 

# 

0.295 

35 

Hc/nce 

0.373 

no 

0.240 

OT 

0.301 

their 

0.273 

if < 

0.290 

36 

KOJIH 

0.371 

to 

0.232 

HeT 

0.294 

Mr. 

0.271 

tfz 

0.287 

37 

MOTTH 

0.367 

wszystko 

0.227 

no 

0.292 

there 

0.266 


0.281 

38 

no 

0.365 

wiedziec 

0.215 

J1H 

0.275 

but 

0.263 


0.275 

39 

MH 

0.361 

inny 

0.207 

pyxa 

0.275 

no 

0.258 

□ 

0.267 

40 

to {conj) 

0.349 

bo 

0.203 

KOTOpblH 

0.274 

come 

0.245 


0.261 

41 

TH 

0.343 

czas 

0.187 

KOiym 

0.273 

so 

0.233 


0.258 

42 

B1A 

0.342 

czlowiek 

0.187 

H3 

0.269 

then 

0.219 


0.253 

43 

OflHH 

0.328 

sam 

0.184 

HH 

0.266 

when 

0.209 


0.247 

44 

Tax {adv) 

0.323 

praca 

0.177 

JHObHTb 

0.264 

man 

0.209 

AL 

0.230 

45 

Mm 

0.320 

oraz 

0.176 

ynce 

0.257 

if 

0.205 

h 7 

0.230 

46 

mod / inpdH 

0.304 

jeden 

0.175 

XOTeTb 

0.248 

about 

0.202 

M5 

0.224 

47 

caM 

0.291 

mowic 

0.174 

o {prep) 

0.248 

which 

0.196 

IS 

0.221 

48 

CTajIBCBKHH 

0.278 

tez 

0.168 

flyma 

0.242 

Stephen 

0.191 

it 

0.219 

49 

TOBOpHTH 

0.274 

lub 

0.168 

KTO 

0.237 

old 

0.185 

*< 

0.213 

50 

to {part) 

0.269 

jeszcze 

0.164 

AJIH 

0.230 

your 

0.184 

Tift 

0.210 

51 

Ta {conj) 

0.258 

przy 

0.161 

ecjiH 

0.228 

who 

0.182 

JEj?- 

0.210 

52 

TijIBKO /-h 

0.256 

przed(e) 

0.154 

HTOdbl 

0.222 

hand 

0.175 

51 

0.207 

53 

Hi 

0.245 

my 

0.153 

o {int) 

0.219 

down 

0.171 

m> 

0.202 

54 

AJ1H 

0.244 

pan 

0.153 

TOBOpHTb 

0.214 

this 

0.169 

T 

0.202 

55 

pyna 

0.241 

mozna 

0.152 

cedn 

0.214 

over 

0.168 

APtS 

0.199 
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